598 research outputs found

    Unsupervised Keyword Extraction from Polish Legal Texts

    Full text link
    In this work, we present an application of the recently proposed unsupervised keyword extraction algorithm RAKE to a corpus of Polish legal texts from the field of public procurement. RAKE is essentially a language and domain independent method. Its only language-specific input is a stoplist containing a set of non-content words. The performance of the method heavily depends on the choice of such a stoplist, which should be domain adopted. Therefore, we complement RAKE algorithm with an automatic approach to selecting non-content words, which is based on the statistical properties of term distribution

    Transforming legal documents for visualization and analysis

    Get PDF
    Regulations, laws, norms, and other documents of legal nature are a relevant part of any governmental organisation. During digitisation and transformation stages towards a digital government model, information and communication technologies are explored to improve internal processes and working practices of government infrastructures. This paper introduces preliminary results on a research line devoted to developing visualisation techniques for enhancing the readability and comprehension of legal texts. The content of documents is conveyed to a welldefined model, which is enriched with semantic information extracted automatically. Then, a set of digital views are created for document exploration from both a structural and semantic point of view. Effective and easier to use digital interfaces can enable and promote citizens engagement in decision-making processes, provide information for the public, and also enhance the study and analysis of legal texts by lawmakers, legal practitioners, and assorted scholars.“SmartEGOV: Harnessing EGOV for Smart Governance (Foundations, methods, Tools) / NORTE-01-0145-FEDER-000037”, supported by Norte Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (EFDR

    Query Expansion Using PRF-CBD Approach for Documents Retrieval

    Full text link

    Classification of protein interaction sentences via gaussian processes

    Get PDF
    The increase in the availability of protein interaction studies in textual format coupled with the demand for easier access to the key results has lead to a need for text mining solutions. In the text processing pipeline, classification is a key step for extraction of small sections of relevant text. Consequently, for the task of locating protein-protein interaction sentences, we examine the use of a classifier which has rarely been applied to text, the Gaussian processes (GPs). GPs are a non-parametric probabilistic analogue to the more popular support vector machines (SVMs). We find that GPs outperform the SVM and na\"ive Bayes classifiers on binary sentence data, whilst showing equivalent performance on abstract and multiclass sentence corpora. In addition, the lack of the margin parameter, which requires costly tuning, along with the principled multiclass extensions enabled by the probabilistic framework make GPs an appealing alternative worth of further adoption

    Semi-automated dialogue act classification for situated social agents in games

    Get PDF
    As a step toward simulating dynamic dialogue between agents and humans in virtual environments, we describe learning a model of social behavior composed of interleaved utterances and physical actions. In our model, utterances are abstracted as {speech act, propositional content, referent} triples. After training a classifier on 100 gameplay logs from The Restaurant Game annotated with dialogue act triples, we have automatically classified utterances in an additional 5,000 logs. A quantitative evaluation of statistical models learned from the gameplay logs demonstrates that semi-automatically classified dialogue acts yield significantly more predictive power than automatically clustered utterances, and serve as a better common currency for modeling interleaved actions and utterances

    Multimedia Retrieval by Means of Merge of Results from Textual and Content Based Retrieval Subsystems

    Get PDF
    The main goal of this paper it is to present our experiments in ImageCLEF 2009 Campaign (photo retrieval task). In 2008 we proved empirically that the Text-based Image Retrieval (TBIR) methods defeats the Content-based Image Retrieval CBIR “quality” of results, so this time we developed several experiments in which the CBIR helps the TBIR. The TBIR System [6] main improvement is the named-entity sub-module. In case of the CBIR system [3] the number of low-level features has been increased from the 68 component used at ImageCLEF 2008 up to 114 components, and only the Mahalanobis distance has been used. We propose an ad-hoc management of the topics delivered, and the generation of XML structures for 0.5 million captions of the photographs (corpus) delivered. Two different merging algorithms were developed and the third one tries to improve our previous cluster level results promoting the diversity. Our best run for precision metrics appeared in position 16th, in the 19th for MAP score, and for diversity value in position 11th, for a total of 84 submitted experiments. Our best and “only textual” experiment was the 6th one over 41

    Document Word Clouds: Visualising Web Documents as Tag Clouds to Aid Users in Relevance Decisions

    Get PDF
    Περιέχει το πλήρες κείμενοInformation Retrieval systems spend a great effort on determining the significant terms in a document. When, instead, a user is looking at a document he cannot benefit from such information. He has to read the text to understand which words are important. In this paper we take a look at the idea of enhancing the perception of web documents with visualisation techniques borrowed from the tag clouds of Web 2.0. Highlighting the important words in a document by using a larger font size allows to get a quick impression of the relevant concepts in a text. As this process does not depend on a user query it can also be used for explorative search. A user study showed, that already simple TF-IDF values used as notion of word importance helped the users to decide quicker, whether or not a document is relevant to a topic

    Extracting collective trends from Twitter using social-based data mining

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-40495-5_62Proceedings 5th International Conference, ICCCI 2013, Craiova, Romania, September 11-13, 2013,Social Networks have become an important environment for Collective Trends extraction. The interactions amongst users provide information of their preferences and relationships. This information can be used to measure the influence of ideas, or opinions, and how they are spread within the Network. Currently, one of the most relevant and popular Social Network is Twitter. This Social Network was created to share comments and opinions. The information provided by users is specially useful in different fields and research areas such as marketing. This data is presented as short text strings containing different ideas expressed by real people. With this representation, different Data Mining and Text Mining techniques (such as classification and clustering) might be used for knowledge extraction trying to distinguish the meaning of the opinions. This work is focused on the analysis about how these techniques can interpret these opinions within the Social Network using information related to IKEA® company.The preparation of this manuscript has been supported by the Spanish Ministry of Science and Innovation under the following projects: TIN2010-19872, ECO2011-30105 (National Plan for Research, Development and Innovation) and the Multidisciplinary Project of Universidad Aut´onoma de Madrid (CEMU-2012-034

    From text summarisation to style-specific summarisation for broadcast news

    Get PDF
    In this paper we report on a series of experiments investigating the path from text summarisation to style-specific summarisation of spoken news stories. We show that the portability of traditional text summarisation features to broadcast news is dependent on the diffusiveness of the information in the broadcast news story. An analysis of two categories of news stories (containing only read speech or including some spontaneous speech) demonstrates the importance of the style and the quality of the transcript, when extracting the summary-worthy information content. Further experiments indicate the advantages of doing style-specific summarisation of broadcast news
    corecore